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A key step toward understanding a metagenomics data set 
is the identification of functional sequence elements within 
it, such as protein coding genes and structural RNAs. Relative 
to protein coding genes, structural RNAs are more difficult 
to identify because of their reduced alphabet size, lack of 
open reading frames and short length. Infernal is a software 
package that implements "covariance models" (CMs) for RNA 
homology search, which harness both sequence and structural 
conservation when searching for RNA homologs. Thanks to 
the added statistical signal inherent in the secondary structure 
conservation of many RNA families, Infernal is more powerful 
than sequence-only based methods such as BLAST and profile 
HMMs. Together with the Rfam database of CIVls, Infernal is a 
useful tool for identifying RNAs in metagenomics data sets. 



An important step in analyzing a metagenomic sequence data set 
is identifying functional sequence elements. This is a prerequisite 
for determining important properties of the biological environ- 
ment the sequence data were sampled from, such as the metabolic 
processes and organismal diversity present there. At least initially, 
functional sequence element identification is addressed computa- 
tionally. One class of elements, functional noncoding RNA ele- 
ments, are especially difficult to identify because they tend to 
be short, lack open reading frames and sometimes evolve rapidly 
at the sequence level even while conserving structure integral to 
their function.'"'' 

Functional RNA elements include both RNA genes (genes 
transcribed into functional untranslated RNA) and cis-reg- 
ulatory mRNA structures. RNA elements play many roles. 
Ribosomal RNAs (rRNAs) and transer RNAs (tRNAs) are 
well known and universally present in all cellular life. Bacteria, 
archaea and viruses, the organisms predominantly targeted by 
current metagenomics studies, also use numerous small RNA 
(sRNA) genes for translational and post-translational regulation,^ 
as well as many cis-regulatory RNAs such as riboswitches (struc- 
tural RNAs that respond to binding small molecule metabolites 
and control expression of nearby genes'*''). Archaea have numer- 
ous small nucleolar RNAs (snoRNAs) homologous to eukaryal 
snoRNAs that direct site-specific RNA methylation and pseu- 
douridylation.'" Many eukaryotes make extensive use of RNA 
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regulatory mechanisms via pathways related to RNA interference 
(RNAi) and micro-RNAs (miRNAs),"''^ and these will become 
relevant to metagenomic studies that target eukaryotes. These are 
only a small list of the most abundant classes of functional RNA 
elements. There are many other examples: catalytic introns, 
eukaryotic spliceosomal RNAs, RNA components of ribonu- 
cleoprotein complexes including telomerase, ribonuclease P, the 
signal recognition particle and more. 

It is striking that several of the large classes of RNAs just men- 
tioned were either discovered recently (miRNAs, riboswitches) 
or have had their numbers greatly expanded by recent analyses 
(sRNAs, snoRNAs). This highlights the relative difficulty in 
discovering and analyzing functional RNA sequences, compared 
with more well-developed methodologies for discovering and ana- 
lyzing protein coding sequences. It hints that other RNAs likely 
remain undiscovered." In this chapter, we will discuss methods for 
computationally identifying homologs of known RNA elements, 
such as riboswitches and sRNA riboregulators. Another problem of 
great interest is to discover entirely new functional RNA elements 
by computational sequence analysis,*'*"''*''^ but as other reviews 
have discussed, de novo RNA discovery (gene-finding) meth- 
ods"*'"* have high false positive rates that are difficult to estimate 
statistically. Computational methods for de novo RNA discovery 
are unsuited to high-throughput automatic analysis, and instead 
need to be used as screens that can be followed by experimental 
confirmation." In contrast, RNA homology search programs are 
sufficiently reliable, and backed by sufficiently well-curated data- 
bases of known RNA sequence families that automated large-scale 
computational metagenomic analyses are feasible. 

Exploiting conserved structure in RNA similarity searches. 
Protein homology search by amino acid primary sequence com- 
parison is powerful. At the amino acid level, BLASTP^" has no 
trouble detecting significant similarity down to about 25-30% 
amino acid sequence identity. Many protein coding regions con- 
serve this level of similarity even across the deepest divergences in 
the tree of life among archaea, bacteria and eukaryotes. 

In contrast, RNA homology search by nucleotide primary 
sequence comparison is much less able to detect distant RNA 
homologies. BLASTN typically requires about 60-65% sequence 
identity to detect a statistically significant similarity for RNAs of 
typical length. Although some RNAs are very highly conserved 
over evolution (notably large and small subunit rRNAs, which are 
readily detected by sequence comparison in all species; the so-called 
human "ultraconserved" regions included regions of rRNA^'), this 
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is not the rule. Many functional RNA homologies are undetectable 
at the primary sequence level in cross-phylum comparisons (such 
as nematode/human or fly/human), because weakly or moderately 
conserved nucleic acid sequences can diverge to the 65% identity 
level in just a few tens of millions of years. 

A striking example of the differing power in detecting pro- 
tein vs. RNA homologs by sequence analysis comes when search- 
ing for homologs of the components of some ribonucleoprotein 
(RNP) complexes. It is not uncommon to detect homologs of the 
protein components but not the RNA components of complexes 
such as the signal recognition particle, ribonuclease P, small 
nucleolar RNPs and telomerase. The interpretation upon find- 
ing only the protein component is usually (and almost certainly 
correctly) that the RNP complex is present in the organism, but 
the RNA component(s) are too difficult to detect. For example, 
the probable presence of small nucleolar RNAs in archaea was 
inferred from the presence of homologs of snoRNP protein 
components like fibrillarin well before snoRNA homologs were 
discovered. "^'^^ A similar situation can occur when identifying 
homologous cis-regulatory RNA elements (such as riboswitches) 
for clearly homologous coding genes. 

Figure 1 shows some specific anecdotal examples. These data 
are fairly typical of searching databases with protein vs. RNA que- 
ries. They demonstrate two key points about the relative difficulty 
in detecting homologs of functional RNAs. First, notice that for 
the protein coding genes, the statistical significance of the similar- 
ity (the E-value) is always much better (lower and more significant) 
when comparing their amino acid sequences rather than when 
comparing their DNA sequences, highlighting the additional sta- 
tistical power inherent in searches at the amino acid level. This is 
the reason for the recommended practice of always comparing pro- 
tein sequences at the amino acid level.^'* Second, notice that RNA 
components are usually much shorter than the coding sequence of 
the protein components, compromising statistical signal and the 
ability of primary sequence analysis (BLASTN) to resolve homolo- 
gous relationships from background. (Sequence accessions used in 
these examples are listed in Table 1.) 

What can be done about the weakness of primary sequence 
based methods for detecting functional RNAs? Some other source 
of statistical signal needs to be found for functional RNAs. Such 
a signal exists: many (though not all) functional RNAs conserve 
a distinctive RNA secondary structure. 

Of course, proteins conserve structure too. Remote homolo- 
gies invisible to primary sequence analysis often become appar- 
ent when a protein's three-dimensional structure is solved. What 
makes RNA secondary structure constraints of particular utility 
for computational sequence analysis is that they produce a simple 
and strong statistical signal of pairwise residue correlations in 
aligned RNA sequences. These correlations may be sufficiently 
obvious that they are apparent even to the naked eye (analogous 
to the obviousness of ORFs for coding gene analysis). RNA 
consensus secondary structures have been accurately inferred by 
manual "comparative sequence analysis" alone. ^^'^^ 

How much extra information does RNA secondary structure 
conservation contribute in addition to primary sequence con- 
servation? We can ask this question rigorously in the context of 



homology search applications, across a range of different types 
of RNAs. Figure 2 shows the average score of search models 
for about 150 RNA sequence families, comparing models of 
sequence conservation alone ("profile hidden Markov models," 
profile HMMs^*'^') to models of sequence plus RNA secondary 
structure conservation ("covariance models," CMs^^''"). These 
consensus models are discussed in more detail below, but for the 
present point, their salient feature is that they are built from an 
input multiple alignment of homologous sequences, and they 
represent that alignment using a scoring system that is position- 
specific, with different scores at each position derived from the 
observed frequencies of the aligned residues at that position. 

The unit of score is a "bit" (essentially the same as BLAST 
"bit scores"), which is a measure of information content."'^^ Some 
intuition can be given for what a bit means, without much mathe- 
matics. A single perfectly conserved RNA residue (probability 1.0) 
contrasted to a uniform expected background (probability 0.25) is 

log: =2 

0.25 

bits of information. You need to ask two yes/no questions to 
narrow four possibilities down to one, thus two "bits" (binary 
units) of information. A position where each residue occurs with 
equal probability (same as expected background) has zero bits 
of information. Imagine two positions that contain a covarying 
Watson-Crick base pair in which each of the four possible base 
pairs occurs with equal frequency 1/4. 

In a sequence-only model, the two positions contribute zero 
bits of information, but in a structure/sequence model, this pair 
contributes two bits of information from the pairwise correlation 
(the expected background in these columns is 1/16 for each of the 
16 possible base pairs, but only four are observed with probability 
1/4 each). In contrast, two columns that form a Watson-Crick 
base pair that is perfectly conserved (a GC with probability 1.0 
for example) always contribute four bits of information, regard- 
less of whether they are modeled together as a pair 
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1.0 
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0.0625 

or independently 

1.0 , 1.0 , 

log: +IOg: = 4 

0.25 0.25 

Thus, the best case for extracting useful sequence information 
from RNA secondary structure that could not be extracted from 
RNA primary sequence consists of covarying base pairs that are 
individually not conserved in primary sequence at all. The more 
highly conserved the aligned RNA sequences are, the more primary 
sequence information content and less covariation will be seen. 

Importantly, for local sequence alignment searches using proba- 
bilistic models, there is a direct and intuitive connection between 
the bit score and the statistical significance (E-value) of a detected 
match.^' Roughly speaking, every three or so bits of score improves 
the E-value by a factor of 10-fold (for high scores, the E-value is an 
exponential function of the bit score x; E is proportional to 2""). 
So, as a rule of thumb, extracting 10 more bits of information for a 
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Figure 1. Honnoiogy search innprovement achieved by utilizing additional infornnation for proteins and structured noncoding RNAs. Exannples of 
identifying coding region homologies by amino acid sequence vs. nucleic acid sequence comparison (BLASTP vs. BLASTN, dashed lines), compared 
with identifying RNA homologies by primary sequence vs. structure/sequence comparison (BLASTN vs. Infernal, solid lines) for several ribonucleopro- 
tein complexes. Filled circles correspond to BLASTP protein searches and Infernal RNA searches. Open circles correspond to BLASTN coding region 
searches and BLASTN RNA searches. Question marks indicate targets that were not found by the indicated search method. Each point is labeled with 
its E-value ("E") and fractional coverage ("cov"), calculated as the fraction of query positions included in the hit alignment. For each query/target 
pair, the query sequence was searched against the target genome (for coding sequence and RNA searches) or predicted proteome (for amino acid 
sequence searches) using the indicated search programs. For example, in the leftmost column, when we use the SRP protein ffh 'm E. coli as a query 
against the 6. subtilis proteome with BLASTP, the top scoring hit is to the ffh protein with an E-value of 6 x 10 and spans 95% of the full protein 
sequence. Using the coding sequence of £ co//'s ffh protein as a BLASTN target against the 6. subtilis genome returns a top hit comprising 61% of the 
ffh coding sequence with an E-value of 6 x 10"'. Thus, using the protein sequence instead of the coding sequence increases the statistical significance 
of the ffh homology match by 116 orders of magnitude, indicated by the length of the dashed line in the leftmost box of the figure. Reported E-values 
are for these single genome/proteome searches, and so would be higher for searches of larger databases. Query RNAs were selected from candidates 
found by Infernal in each listed query genome's sequence using the Rfam 11.0 CM for the appropriate family (listed below). Each query RNA was used 
to build a CM using the Infernal-imposed Rfam structure, and each CM was calibrated and used to search the target genomes. Rfam family IDs for each 
family, in row order are: RF00169, RF00001, RF00168, RF00373, RF00234, RF00174. For riboswitches, the protein components are always immediately 
downstream of the RNA components. 
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Table 1. GenBank genome and protein accessions and RNA genomic coordinates for examples from Figure 1 



Organism name 


Genome 


Protein 


Protein 


RNA 


RNA 




accession 


name 


accession 


name 


genomic coordinates | 


Methonocaldooccus jonnoschii 


NC_00909.1 


Rpp29 


NP_247439.1 


RNase P RNA 


643504-643761 


Pyrococcus horikoshii 


NC_00961.1 


Rpp29 


NP_143607.1 


RNase P RNA 


168208-168414 ' 


BgcHIus cereus 


NC_003909.8 


IvsC 


NP_0978199.1 


Lysine riboswitch 


1818638-1818452 


Bocillus subtilis 


NC_000964.3 


IvsC 


NP_390725.1 


Lysine riboswitch 


2910872-2911051 j 


5ulfolobus solfataricus 


NC_002754.1 


rpl30p 


NP_342208.1 


5S rRNA 


78064-77946 


m 

Thefivococcus kodakorensis 


NC_006624.1 


rpl30p 


YP_1 83933.1 


5S rRNA 


1769482-1769599 ' 


Bacillus subtilis 


NC_000964.3 


gImS 


NP_388059.1 


GIms riboswitch 


200006-200173 


Bocillus subtilis 


NC_000964.3 


ffh 


NP_^389480.1 


SRP RNA 


26531-26633 j 


Escherichia coli 


NC_000913.2 


ffh 


NP_417101.1 


SRP RNA 


475679-475778 


Escherichia coli 

m 


NC_000913.2 


btuB 


NP_41 8401.1 


Cobalamin riboswitch 


416407-4161597 | 


Klebsiella pneumonia 


CP000647.1 


btuB 


ABR78634.1 


Cobalamin riboswitch 


4660061-4660248 


Yersinia enterocolitica 


NC_08800.1 


btuB 


YP_01 004531.1 


Cobalamin riboswitch 


157101-157301 1 


Vibrio cholera 


NC_009457.1 


btuB 


YP_001 21 8242.1 


Cobalamin riboswitch 


2498535-2498369 


Acinetobacter baumannii 


NC_011586.1 


btuB 


YP_002320687.1 


Cobalamin riboswitch 


3485342-3485537 



homology search means shifting E-values favorably by three orders 
of magnitude. This increase in resolution doesn't matter much if 
a sequence is already readily detected by primary sequence com- 
parison (improving an already significant E-value of 10"^° to 10"^^, 
for example), but it becomes important when lifting a marginally 
insignificant E-value to significance (0.1 to 10"'', for example). 

Figure 2 shows the extra bits of information contributed by 
including RNA secondary structure in "typical" RNA search 
models. These models are all position-specific profiles built from 
alignments in the Rfam RNA families database, described below. 
There is substantial variation from family to family, but the extra 
information contributed by secondary structure is often on the 
order of 10 to 20 bits or more, depending on the length and con- 
servation of the alignment, which would be expected to improve 
E-values of homologs by about three to six orders of magnitude. 
This improvement can be seen in the results of the anecdotal 
searches of Figure 1 comparing the E-values obtained by primary 
sequence BLASTN searches to those obtained by Infernal,^'' a 
sequence+secondary structure RNA homology search tool, as we 
will discuss in more detail below. 

The conclusion here is that while primary sequence is still the 
dominant source of information for these searches, adding second- 
ary structure contributes enough information content that we can 
expect a structure+sequence method to resolve some homologs 
that were not quite resolvable by sequence analysis alone. 

Infernal: software for RNA homology search and alignment. 
Computational methods that combine RNA secondary structure 
and sequence conservation information in a single consistent statisti- 
cal model have been developed, based on probabilistic models called 
"stochastic context-free grammars" (SCEGs).^**'^"''''^' Dynamic 
programming algorithms exist for optimal alignment of SCFGs to 
target sequences, analogous to algorithms for sequence alignment 
except that SCFG algorithms are aligning by base-paired secondary 
structure in addition to sequence.^'*''^''" A particular formulation of 
SCFGs, called covariance models (CMs), was developed specifically 



for automatic construction of statistical models from input RNA 
secondary structures or input multiple alignments annotated with 
consensus RNA structure. This technology is implemented in a 
freely available software package called Infernal (www.infernal. 
janelia.org). 

A variety of other computational tools for RNA homology 
search exist besides Infernal. Some of the most popular tools 
are ERPIN,^' FASTR,"^ RSmatch,« RNAMotif ''^ RNATOPS,« 
and PatScan.**^ Infernal is one of the most generally applicable 
and powerful''^ tools and is the basis for a widely used RNA fam- 
ily database (Rfam; described below). Here we will restrict our 
discussion to Infernal. 

To demonstrate how scoring structure increases statistical 
power for RNA homology search, we used Infernal to build CMs 
and perform searches for the single sequence/structure queries in 
Figure 1 (the structures were obtained from the Rfam database, 
described below). As expected, modeling structure makes the tar- 
get RNA more distinguishable from background, as evidenced by 
the decrease in E-values between BLASTN and CM searches of 
between one and 25 orders of magnitude. 

Figure 3 provides more detail for the cobalamin (B12) ribo- 
switch example from Figure 1. It shows the Escherichia coli query 
sequence and secondary structure, and the pattern of conserva- 
tion in two different homologs found by a CM built from the 
E. coli query. Notice that although many of the residue substitu- 
tions between query and target are in the predicted loop regions, 
those that occur in a position that is base-paired are often accom- 
panied by a compensatory change in the paired position to main- 
tain a Watson-Crick or GU/UG pair. The extra information 
from the E. coli structure allows Infernal to find the homologous 
riboswitch in the Acinetobacter baumannii genome as the top 
scoring hit with a significant E-value of 8 x lO"'', despite it shar- 
ing only 55% sequence identity with the E. coli riboswitch. The 
analogous search with BLASTN does not identify the riboswitch 
homology (no reported hits). 
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Figure 2. Additional information (in bits) gained by structure/sequence profiles vs. sequence-only profiles for various RNA families. Structure/se- 
quence profiles are most advantageous for families with less primary sequence information (toward left) and more secondary structure information 
(toward top), so Rfam families that gain the most from including secondary structure terms in a homology search are those toward the upper left 
quadrant. Data shown for the 164 Rfam release 11.0 families^^ with 50 or more sequences in the "seed" alignment (with the exception of SSU rRNA 
bacteria and SSU rRNA eukarya which would have been outliers on the plot with x-axis values above 1,900 bits and y-axis values above 150 bits). For 
each family, the seed alignment was used to build two profile models, one with structure (sequence/structure profile CM model) and one without 
(sequence profile HIVIiVl model). From each model, 10000 sequences were generated and scored, and the average score per sampled sequence was 
calculated. Several of the outlying points are labeled by the name of RNA family as given by Rfam. Note that the x-axis is drawn on a log scale. IVlodels 
were built and sequences were generated and scored using Infernal version 1.1 programs cmbuild, cmemit and cmalign. A slightly modified version 
of this figure will appear in a book to be published by Springer Humana Press entitled "RNA sequence, structure and function: computational and 
bioinformatic methods," edited by Jan Gorodkin and Walter Ruzzo, in chapter 9, entitled "Annotating functional RNAs in genomes using Infernal" as 
Figure 2. This figure is included here with kind permission from Springer Science-FBusiness Media B.V. 



CMs can be built from single RNAs, but they are most 
powerful when built from a multiple sequence alignment with 
consensus secondary structure annotation. CMs implement a 
position-specific ("profile") scoring system, where each consen- 
sus single-stranded position or base pair is represented by its own 
set of four or 16 scores, and insertion/deletion scores are like- 
wise specific to each point where an insertion or deletion can 
occur. Given enough aligned sequences, a position-specific pro- 
file model can learn which residues or base pairs are highly con- 
served, what substitutions are tolerated by evolution and where 
an RNA does and does not frequently tolerate insertion and 
deletion of sequence residues or structural domains. Given only 
a single RNA sequence (as in the examples in Figs. 1 and 3), the 
CM scoring system reverts to a position-independent parameter- 
ization representing the averaged constraints on typical RNAs, 
essentially analogous to the use of score matrices in pairwise 
sequence alignment methods like BLAST.'"* 

CMs are probabilistic models, meaning that all the scoring 
parameters are probabilities rather than arbitrary scores and 
penalties. This helps in managing the complexity of setting 
a large number of parameters in an objective, automatic, and 
mathematically justified way; a consensus tRNA CM has about 



1,500 parameters and a consensus LSU rRNA CM has about 
50,000 parameters that need to be determined. Using prob- 
abilities as parameters also helps in interpreting the significance 
of potential matches in a database search, and in calculating 
confidence values (posterior probabilities) associated with each 
residue in a proposed alignment. The use of probabilistic mod- 
els for RNA structure/sequence analysis follows in the wake of 
similar techniques in primary sequence analysis, where score 
profiles (also called position-specific scoring matrices, PSSMs) 
have been made more powerful and consistent using proba- 
bilistic models called profile hidden Markov models (profile 
HMMs). 

A CM can be used for a variety of alignment and search 
tasks. For example, very large numbers of RNA sequences can 
be aligned to a single RNA structure consensus with reasonable 
accuracy and efficiency: the Ribosomal Database Project (RDP) 
uses Infernal to produce alignments of hundreds of thousands of 
small subunit (SSU) rRNAs.'" For sequence annotation, includ- 
ing metagenomic analysis, the main use of CMs is for homology 
search. 

Because Infernal requires that the user provide a consensus 
RNA secondary structure for the query RNA, and because CMs 



1174 



RNA Biology 



Volume 10 Issue 7 



C^UQACCU^ 



AO 

0-C 
•U-A. 



COo"^^AA 



query 

Escherichia coli 
Cobalamin riboswitch 
NC_000913 
4161407-4161597 



5' 



0-C 

AUA-Uqccqqcc^^ 

A I I I I I I 

A-C K 
0-U g 

uo iC 

CUOUAOCAU^ 

AO 

0-C 



0"% 

OAUOAU OAC 



jiUACUA CUO 



cuocc^u 

I I I I I _ 

_ OOUOOcC 



A "A 
A- ° 
^AUUCU^" 



uoo. 



aAU, 



iU-A, 



CCOOA^ 

''III 



COO' 



UqUOgCCU^ 



target 1 

Yersinia enterocolitica 
Cobalamin riboswitch 
NC_008800 
157075-157301 



.A A: 



o-c^^ 
8-8, 



K 

a 

IGG 
CCCU 



UUcA 

U A 
G C 

^ O C 

9GCOAU OAC CyOuCGy 
CCGCUA CUO OOuAdgAA 



cu 



"uo-c 

A-U 
0-C 

aa aa 



OCCOOUAUA 

I I I I I I 
OOCCU 

G 



A°"CC00C 

A I I I I I 

AquOGUCG^ 



'aaauucauu 



c C 

c c 

C-O 
0-C 
CU-A 




CUGAUAC 

'Ay A u U 



70% seq identity with E.coli 
CM E-value: 1e-22 
BLASTN E-value: 2e-13 



. gaAAjj 



A 



target 2 

Acinetobacter baumannii 
Cobalamin riboswitch 
NC_011586 
3485374-3485522 



coo"^^ 
o-c^«^ 

0-C 
UA-U 

5' 3' 



V 



CAAlJ 

PUCG 

c 

G 
G 

u 

A 

u, 



G XcAU"; 

CAUAUy ACC CUOCA 

uAuAAnUGO 66u6ij'3 



U 
A 
.A 



55% seq identity with E.coli 
CM E-value: 8e-4 
No BLASTN hit 



A AG* 



Figure 3. Secondary structure of three cobalamin riboswitches. Using the £ coli sequence 
as a query against their respective genomes, BLASTN detects the Y. enterocolitica cobalamin 
riboswitch with a significant E-value, but not the A. baumanii riboswitch. Infernal searches 
with a CM constructed from the E. coli sequence and structure (from the Rfam seed align- 
ment for family RF00174") find both riboswitches with increased significance values. These 
example searches are also used in Figure 1. Note that the /\. baumanii riboswitch prediction 
by Infernal is not full length, and excludes the 5' and 3' ends. Presumably, the A. baumanii ri- 
boswitch extends past the boundaries of the Infernal prediction, but is sufficiently diverged 
from the £ coli sequence and structure to not be included in the optimal hit alignment. 
Structures of the targets and percent identity figures were derived from the highest scoring 
CM alignment of each target to the query (£ coli). Sequence substitutions and insertions in 
the targets with respect to the query are shown in gray. Inserted residues with respect to 
the query are shown in lowercase. Basepairs in the Rfam annotated structure are connected 
by solid lines. All riboswitches are immediately upstream (5'; within 100 residues) of btuB 
vitamin B12 transporter protein coding genes in their respective genomes. 



are most powerful when models are built from multiple sequence 
alignments, a fair amount of work might be invested in carefully 
assembling a high-quality multiple sequence alignment annotated 
with a consensus structure. This investment may be feasible if one 
is only interested in sequence analysis of a particular RNA family, 



such as rRNA. However, if the goal is compre- 
hensive high-throughput annotation of many 
different functional RNAs, for instance as part 
of analyzing a new metagenomic sequence data 
set, it would be useful to have access to a large 
number of structure-annotated RNA alignments 
and prebuilt CMs. Much as protein domain 
databases like Pfam and SMART have collected 
on the order of 10000 protein domain sequence 
alignments for systematic profile HMM analy- 
sis, ^"'^ there is a database called Rfam that has 
systematically collected RNA alignments and 
CMs.» 

Rfam: High-throughput RNA homology 
search and annotation. The Rfam database'" 
is a curated and annotated collection of RNA 
sequence families, intended for the purpose of 
systematic, automated, high-throughput anno- 
tation of functional RNA elements in genomic 
and metagenomic sequence data. The current 
(11.0) version of Rfam contains 2208 families 
(rfam.sanger.ac.uk). Each Rfam family con- 
sists of three main components: a representative 
"seed" alignment, a covariance model (CM) 
built from the "seed" alignment and a compre- 
hensive "full" alignment. 

The "seed" alignment is intended to be a 
small, stable and curated alignment of rep- 
resentative members of the sequence family, 
annotated with a consensus RNA secondary 
structure. For example, the glycine riboswitch 
(RF000504; rfam.sanger.ac.uk/family/ 
RF00504) is represented by an alignment of 44 
RNAs. 

The "full" alignment is intended to be com- 
prehensive. It consists of an Infernal-generated 
structural alignment of all homologous RNAs 
detected by Infernal in a search, using the CM 
built from the "seed" alignment, of a composite 
DNA sequence database, RFAMSEQ, which 
now includes both genomic and metagenomic 
sequence data.'^ 

The alignments are useful for a variety of 
purposes, such as phylogenetic tree inference, 
examining the phylogenetic range over which 
a given RNA family occurs, or as a source of 
training data for other RNA structure analysis 
methods. For metagenomic analyses, the main 
application of Infernal and Rfam is homology 

1 search, and the main resource is the set of pre- 

buih Rfam CMs. 
The Infernal package (infernal.janelia.org) and Rfam CM files 
(rfam.sanger.ac.uk) can be freely downloaded and used to identify 
homologs of known functional RNAs in a metagenomics data set. 
As an example of such an analysis, we performed CM searches 
of a previously published metagenomics data set.'^ The data set 
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includes about 200000 whole genome shotgun sequencing reads 
totalling about 230 Mb derived from samples of agricultural soil 
(-140 Mb, accession AAFXOIOOOOOO) and three "whale fall" car- 
casses (-90 Mb, accessions AAFZOOOOOOOO, AAFYOIOOOOOO, 
AAGAOOOOOOOO). To simplify the analysis for our illustrative 
purposes here, we searched only for riboswitches, using the 26 
Rfam 11.0 CMs of type "cis reg; riboswitch."'^ For comparison, we 
repeated the search with BLASTN, using each individual sequence 
in the Rfam seed alignment as a BLASTN query and combin- 
ing the results to identify any significant matches.'* Additionally, 
we performed searches with Infernal vl.l using non-structured 
HMM-like models from alignments without secondary structure 
that ignore secondary structure and score only primary sequence 
conservation. (These models were created using the — noss option to 
Infernal's cmbuild program and are essentially equivalent to profile 
HMMs, so we refer to them as HMMs below). Comparison of the 
BLAST, HMM, and CM search results illustrates the relative con- 
tribution of the two main differences between BLAST and CMs: 
the use of probabilistic profiles instead of pairwise comparisons (by 
comparing BLAST and HMM results), and scoring both sequence 
and RNA structure (by comparing CM and HMM results). 

Table 2 includes the number of putative riboswitches (hits) 
with E values less than 10"' found for each family using each 
method. For the BLAST searches, E-values were multiplied 
by the number of queries per family. For example, an E-value 
reported by BLAST of 10"' for the FMN family would be cor- 
rected to 1.44 X 10"' because there were 144 BLAST queries. 
Also displayed in Table 2 are the number of hits detected by one 
method but not another for all six possible pairwise combina- 
tions of the three methods. Using the strict 10"' E-value cutoff 
Infernal CM searches found 145 total putative riboswitches in 
the soil and whale falls data set; Infernal "no structure" HMM- 
like searches found 133; and BLAST found 96. The HMM-like 
searches detected 45 hits that BLAST did not, and CM searches 
detected 16 hits that HMMs did not, indicating that using pro- 
files and additional scoring of structure both contribute signifi- 
cantly to an increased sensitivity of CMs over BLAST. Also note 
the significant difference in average coverage (fraction of the 
query sequence covered in the hit alignment) between BLAST 
(0.66) and the HMM and CM searches (0.98 and 0.97, respec- 
tively). BLAST tends to find shorter hits of high identity, while 
HMMs and CMs often return full-length hits that are more 
informative for annotation. 

We can compare these results to the published results of a sim- 
ilar analysis of riboswitch occurrence in the same metagenomic 
data set using different search methodology." Kazanov et al. used 
the pattern based search program RNA-PATTERN to identify 
candidates of 11 riboswitch families (eight of which we used in 
our analysis) in the same soil and whale falls data we analyzed. 
For the eight families in common, their pattern-based search 
detected 103 candidate riboswitches, compared with 129 identi- 
fied by CM searches at a stringent threshold. RNA-PATTERN 
detected 11 candidates that CMs did not, and CMs detected 39 
candidates that RNA-PATTERN did not. The largest differences 
were for the cobalamin family, for which CMs found 20 candi- 
dates undetected by RNA-PATTERN, and the glycine family, for 



which RNA-PATTERN found 11 candidates undetected by CMs 
using a CM E-value threshold of 10"'. Three of these 11 are found 
by the glycine riboswitch CM, but with E-values just below the 
strict threshold, ranging between 10"' and 10"'. The remaining 
eight are all immediately adjacent to glycine hits that Infernal does 
find. This suggests they are likely functional riboswitches because 
glycine riboswitches usually occur in tandem with two similar 
structure right next to each other. We found when we looked into 
this that due to an implementation detail, Infernal sometimes 
misses one of two RNAs if they appear very close together. This is 
a limitation of the software we hope to remedy in a future version. 
Currently, the best way to search for families that occur in tandem 
is to build and search with a single model of both structures simul- 
taneously. Repeating this Infernal search with this kind of model 
(built from a new "seed" alignment created by simply concatenat- 
ing two copies of the original Rfam "seed") finds highly significant 
(E < 10"'°) tandem glycine structures that completely cover all 11 
RNA-PATTERN hits missed by the original Rfam model. 

Can we trust that the statistically significant matches to the 
CM are really homologs, and that increased numbers of predic- 
tions really reflect increased detection sensitivity? That is, in the 
demonstration experiment here, where we are just counting the 
number of hits detected below some E-value threshold and assert- 
ing that these are all probable homologs, it is possible that Infernal 
is instead merely assigning incorrectly low E-values to non-homol- 
ogous sequences. One way to test the accuracy of any program's 
E-values is to search randomized non-homologous sequence; one 
expects the top-scoring random match to have an E-value on the 
order of 1 (by definition of expectation value: the number of hits 
you expect to see in this database search with a score this high 
just by chance). This sort of test is a useful control experiment to 
run whenever thinking of adopting any new search method. In 
one experiment of ours,''' involving a benchmark of 51 CMs being 
searched against a 10 megabase synthetically generated target 
sequence, the highest non-homologous hit had an E-value of 0.009, 
about what you'd expect from doing 51 independent searches (1/51 
= 0.019) if E-values were accurate. In our experience, an E-value 
threshold of 10"' is conservative. Most importantly, an indepen- 
dent benchmark of a variety of RNA similarity search methods has 
been published,''^ which generally found that CM based methods 
are the most sensitive and specific methods available. 

Limitations of CMs. Now the fine print. Users applying 
Infernal and Rfam for metagenomics analysis should be aware of 
four important limitations of CM similarity search: 

(1) Infernal is slower than BLAST. In the riboswitch example 
above, the 26 CM searches took about 45 min on one proces- 
sor, about 15 times longer than BLAST searches (3 min on one 
processor for all 2,555 pairwise searches). Repeating the Infernal 
search using all 2,208 Rfam 11.0 models against the 230 Mb data 
set would take roughly 100 h. Significant compute power (such 
as a moderate sized cluster) is required to do large scale analy- 
ses with CMs. Infernal is parallelized to use multiple threads on 
multicore computers and for use on clusters using the Message 
Passing Interface (MPI),"" although neither method was enabled 
for the searches reported here. 
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Table 2. Riboswitch search results 



# Different seqs found 1 




# Seqs found 




BLAST BLAST HMM HIVIIVI CM 


CM Avg fractional coverage 


Family 


Rfam ID 


#seed seqs 


BLAST HIVIIVI 


CM -HMM -CM -BLAST -CM -BLAST 


-HMM BLAST 


HMM CM 


FMN 


RF00050 


144 


9 9 


9 


0.90 


1.00 1.00 


TPP 


RF00059 


115 


26 38 


37 1 12 1 12 


053 


0.97 0.99 


SAM 


RF00162 


433 


14 14 


14 1 1 1 1 


0.68 


0.99 0.99 


Purine 


RF00167 


133 








1 


Lysine 


RF00168 


47 


1 


1 1 1 




1.00 1.00 


Cobalamin 


RF00174 


430 


19 38 


49 3 23 31 


11 0.48 


0.95 0.92 ' 


gImS 


RF00234 


18 


1 2 


3 1 2 


1 0.36 


1.00 1.00 


Glycine 


RF00504 


44 


11 14 


16 1 4 15 


3 0.75 


1.00 0.99 


SAM a 


RF00521 


40 


5 8 


7 3 12 


0.74 


1.00 1.00 


PreQI 


RF00522 


41 








i 


SAM-IV 


RF00634 


40 










preQ1-ll 


RF1064 


14 








i 


MOCO RNA motif 


RF01055 


160 










Mg sensor 


RF01056 


4 










SAH riboswitch 


RF01057 


52 


2 2 


2 


1.00 


1.00 1.00 


AdoCbl riboswitch 


RF01482 


7 


1 


1 1 


0.25 


i 

1 


MFR 


RF01510 


2 










AdoCbl-variant 


RF01689 


144 








1 

1 


SAM-l-IV-variant 


RF01725 


439 


3 2 


2 11 1 


1 1.00 


1.00 1.00 


SAM-SAH 


RF01727 


53 


2 2 


2 


1.00 


1.00 1.00 


SMK box 
riboswitch 


RF01767 


25 










c-di-GMP-ll 


RF01786 


54 


3 3 


3 


1.00 


1.00 1.00 


Drz-agam-1 


RF01787 


7 










Drz-agam-2-2 


RF01788 


5 








1 


SAM V 


RF01826 


6 










THF 


RF01831 


98 








1 


Total 




2555 


96 133 


145 7 4 45 4 54 


16 0.66 


0.98 0.97 


"# seed seqs," the number of sequences in the Rfam seed alig 


nments used to build each model for "CM" and "HMM' 


' columns, and number of query 



sequences for "BLAST" columns. "# seqs found," the number of hits receiving E-values better than (less than) 10"' for each search method in the 230 Mb 
soil and whale falls data set'' for each Rfam 11.0 riboswitch family. For BLAST columns, reported E-values were multiplied by the number of query 
searches as a correction for multiple tests. Only corrected E-values of less than 10 ' were counted. "# different seqs found," number of hits detected 
by one method and not detected by another for each pairwise combination of methods, for example: "BLAST-HMM"for RF00162 is 1 because one hit 
detected by BLAST was undetected by the HMM search. A blank space in a cell indicates 0. "HMM" refers to the "non-structured" HMM-like Infernal 
models, "avg fractional coverage" is the average fractional coverage of the alignment of the hit, with respect to the query. For example, a BLAST hit 
that involves positions 26 to 75 of a length 100 query sequence has a coverage of 0.5. The "total" row for these columns gives the average coverage 
over all hits to all families. 



Though still slow compared with BLAST, Infernal is much 
faster than it was just a few years ago. The current version (vl.l) 
is about 10,000 times faster than version 0.55." The speedup is 
due to heuristics, including profile HMM filtering'^"* and banded 
dynamic programming," which sacrifice a small amount of sen- 
sitivity for the increased speed. This sensitivity sacrifice, though 
small, disproportionately impacts remote homology detection.'^" 
It may be worthwhile to switch off the heuristic speedups for 
smaller scale analyses if the requisite compute power is at hand. 
Conversely, if compute power is limiting, the heuristic speedup 



parameters can be tuned for greater acceleration at a greater cost 
in sensitivity.''" Further acceleration remains a goal of Infernal 
development. 

Another computationally expensive step of CM similarity 
search is "calibrating" models in order to obtain E-values for 
search results, and to determine the appropriate filtering scheme 
for maximum speed without significant sensitivity loss. Infernal's 
cmcalibrate program must run several large computational simu- 
lations, and this takes several CPU hours for a typical sized CM. 
The CMs from the Rfam database come pre-calibrated, so Rfam 
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users do not have to pay this cost, but any custom buik models 
need to be cahbrated. 

(2) A CM models only a single user-provided RNA consensus 
structure. Many RNA structures are inferred, rather than being 
determined by crystallographic or NMR methods, so second- 
ary structure annotation may well be at least partially incorrect 
especially in large collections like Rfam, where curation of a set 
of over 2000 consensus structures is challenging. Additionally, a 
single consensus structure is unable to properly capture the evo- 
lutionary variation observed among individual homologous sec- 
ondary structures, except in a crude way (as structural deletions 
and insertions relative to the consensus). And finally, an assump- 
tion that an RNA adopts only a single secondary structure is only 
an approximation, as RNAs (like proteins) are sure to exist in an 
ensemble of different structures (perhaps bound and unbound to 
a protein or substrate). Riboswitches, for example, are a dramatic 
example of the function of an RNA depending on at least two 
distinct structural conformations. 

(3) CMs ignore some aspects of RNA structure. By their 
nature, CMs are only able to model a canonical secondary struc- 
ture consisting of exclusively nested base pairing relationships, 
meaning a set of base pairs for which no two pairs overlap in 
sequence position (no two pairs between positions i:j and k:l 
exist such that z < k < j < I). This means CMs do not model 
RNA pseudoknots, base triples, nor most other contacts found 
in RNA tertiary structure. The goal of a CM is not to model 
RNA structure completely, but rather to harness as much addi- 
tional structural information as possible for more accurate RNA 
search and alignment, while still allowing for reasonably efficient 
algorithms. Capturing yet more higher-order RNA structural 
information is possible, but it violates the constraints of SCFG- 
type probabilistic models and comes at a disproportionate cost 
in computational efficiency.^' Other methods exist for RNA 
homology search that can model RNA pseudoknots, including 
ERPIN"' and RNATOPS.« 

(4) Using a CM for non-structured RNAs is pointless. Many 
RNAs may not require a conserved structure for their function. 
For example, antisense regulatory RNAs that control gene expres- 
sion simply by basepairing to target mRNAs are acting as primary 
sequences, and they do not necessarily conserve any intramolec- 
ular secondary structure. Though CMs can model RNAs with 
no consensus base pairs (Eddy 2003), it is more practical and 
appropriate to use profile HMMs rather than CMs, avoiding the 
CMs computational costs. For this reason, Infernal's cmsearch 
program automatically detects when a query model with zero 



basepairs is being used, and employs the more efficient profile 
HMM search algorithms instead of the slower CM methods. 

Materials and Methods 

Software used: NCBI-BLAST 2.2.27+ for BLASTN and 
BLASTP and Infernal version 1.1. Default settings were used 
for all searches, with the following exceptions: a single CPU was 
used, instead of allowing multithreading, using the num threads 
1 option for BLAST and the — cpu 0 option for Infernal; and 
for BLASTN of protein coding sequences in Figure 1, word size 
was set at 8 (word_size 8) because it resulted in lower E-values 
than the default word size of 11 in some cases. The GenBank 
genome and protein accessions and RNA genomic coordinates 
for searches in Figure 1 are listed in Table L 

Conclusion 

Compensatory base pair changes in RNA sequence alignments 
are strikingly apparent even to the eye. The deeper the align- 
ment (the more sequences known to conserve roughly the same 
structure), the more the RNA structure becomes obvious by 
sequence analysis alone. Robin Gutell and coworkers were able 
to predict the secondary structure of rRNA to greater than 98% 
accuracy per base pair by essentially manual comparative analy- 
sis of careful rRNA alignments,^' and Francois Michel and Eric 
Westhof essentially predicted the structure of group I intron 
catalytic introns in much the same way.""" The automation of 
comparative RNA structure/sequence analysis is essentially 
the basis of algorithms that combine RNA secondary structure 
and sequence analysis to enable identification of more remote 
RNA homologs than primary sequence methods alone can 
achieve. These methods can be used to search metagenomics 
data sets for known families of RNAs using a combination of 
the Infernal software (infernal.janelia.org) and CMs from the 
Rfam database.'^ 
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